Technical Details
Handling Hallucinations in LLMs
Retrieval Augment Generation pattern can be used to ground the LLM response within the context provided.
The context can be retrieved by searching in the organization’s knowledge-based or document repository.
Moreover, citations to the referred documents or articles can also be provided in the final response
Hallucination Metrics
Contextual Hallucination
Contextual hallucination highlights the limitations of LLMs in truly understanding context and generating responses that reflect genuine comprehension or knowledge.
Hallucination Score (Infosys)
The hallucination score is determined through G- Eval metrics values and assessing similarity score across multiple categories, including input prompt to output prompt, input prompt to source, and output prompt to source. This score is a weighted average that considers the scores from these below categories.
we scaled the avg score of G-eval
avgmetrics = ((faithfullness[]+Relevance[]+ Adherence[2]+ correctness [])/4/5
When the average score of G- Eval Metrics (avgmetrics) is greater than 0.75
Hallucination score = 1 - avgmetrics
When the average score of G- Eval Metrics is between 0.5 and 0.75
we calculate the average prompt similarity score and classify them based on the maximum prompt similarity score
avgsimilarity=(inpoutsim+ressourcescore+inpsourcesim)/3
maxscore
= max(inpoutsim+ressourcescore+inpsourcesim)
if maxscore>0.75
Hallucination score = 1-avgsimilarity*0.8-avgmetrics*0.2
if maxscore>=0.5 and maxscore<0.75
Hallucination score = 1-avgsimilarity*0.2-avgmetrics*0.8
if maxscore<0.5
Hallucination score = 1-avgsimilarity*0.5-avgmetrics*0.5
· When average score of G- Eval Metrics is less than 0.5
Hallucination Score = 1 - avgmetrics
Here, inpoutsim
represents the Input-Output similarity score, inpsourcesim
represents the Input-Source similarity score, and ressourcescore
represents the Output-Source similarity score.
For Non - RAG Scenario
we are using the prompting technique to get the Hallucination score based on the Relevance, Factual Accuracy. The range of the score is between 0 to 1.
0 indicates the answer is highly relevant to the prompt, it is highly realistic, the answer contains no factual errors and the answer is not at all nonsensical.
1 indicates the answer is highly unrelated to the prompt, it is highly implausible or unrealistic, it is completely factually incorrect and highly nonsensical.
To interpret the score, we had asked to avoid assigning the score 0.5
Perplexity Score
Perplexity is a measure on how well the model predicts the next word or character based on the context provided by the previous words or characters. The lower the perplexity score, the better the model’s ability to predict the next word accurately.
For example, if a language model predicts that there is a 0.5 chance that the next word is "dog" and a 0.5 chance that it is "cat", the probability distribution would be [0.5, 0.5]. The geometric mean of these probabilities would be the square root of their product, which in this case is 0.7071. The perplexity score would then be the inverse of this value, or approximately 1.4142. This means that the model would be slightly surprised to see either "dog" or "cat" as the next word given the context provided by the previous words or characters. If the model were perfect and predicted the correct word with certainty, its perplexity score would be 1. If it performed poorly and predicted each possible output equally likely, its perplexity score would be infinity.
Perplexity is calculated using average cross-entropy, which in turn is calculated using the number of words in the data set and the predicted probability of a word (target word) as per the preceding context.
Where N is the total number of words in the data set and P(w_i | w_1, w_2, …, w_{i-1}) is the predicted probability of the i-th word given the preceding context (w_1, w_2, …, w_{i-1})
Coherence
The collective quality of all sentences. We align this dimension with the DUC quality question of structure and coherence whereby "the summary should be well-structured and well-organized. The summary should not just be a heap of related information but should build from sentence to a coherent body of information about a topic." GPT model is used as an evaluator, and this is a prompt-based evaluation method.
Consistency
The factual alignment between the summary and the summarized source. A factually consistent summary contains only statements that are entailed by the source document. GPT model is used as an evaluator, and this is a prompt-based evaluation method.
Relevance
Selection of important content from the source. The summary should include only important information from the source document. Annotators were instructed to penalize summaries which contained redundancies and excess information. GPT model is used as an evaluator, and this is a prompt-based evaluation method.
Fluency
The quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure. GPT model is used as an evaluator, and this is a prompt-based evaluation method.
Source: https://arxiv.org/pdf/2303.16634.pdf
Uncertainty
Uncertainty refers to the LLM's lack of confidence in the correctness or accuracy of its generated text. When an LLM is presented with a prompt, it can generate multiple possible continuations or completions. The uncertainty associated with each completion reflects the LLM's internal assessment of how likely each continuation is to be the most appropriate one.
Structural Uncertainty: We use normalized entropy to calculate how uncertain the model was in each token selection. If the model leans heavily towards one answer, the entropy is low. But if it's torn between multiple options, entropy spikes. The normalization step ensures we're comparing things consistently across different prompts.
Conceptual Uncertainty: For each sampled token in the response, we create a 'partial' version of the potential response up to that token. Each of these partial responses is transformed into an embedding. We then measure the distance between this partial response and the model's final, complete response. This tells us how the model's thinking evolves as it builds up its answer.
Where MCD is the mean cosine distance between all choice embeddings, C is the number of choices, D represents the set of average cosine distances calculated for each partial response relative to its complete response.
If Structural Uncertainty is low but Conceptual Uncertainty is high, the model is clear about the tokens it selects but varies significantly in the overall messages it generates. This could imply that the model understands the syntax well but struggles with maintaining a consistent message.
H normalized is normalized entropy.
Conversely, high Structural Uncertainty and low Conceptual Uncertainty could indicate that the model is unsure at the token-level but consistent in the overall message. Here, the model knows what it wants to say but struggles with how to say it precisely.
If both are high or both are low, it may suggest that the token-level uncertainty and overall message uncertainty are strongly correlated for the specific task, either both being well-defined or both lacking clarities.
Source: https://www.watchful.io/blog/decoding-llm-uncertainties-for-better-predictability
Token Importance
To calculates the importance of individual tokens in a text prompt by comparing the original embedding with an embedding where a single token is removed.
Methods
Tokenization: Break the prompt down into its individual words or 'tokens.'
Text Embedding: Convert the text prompt into a numerical format, known as an 'embedding.'
Ablation and Re-Embedding: Remove each token one by one, create a new embedding, and then compare it to the original.
Importance Estimation: The degree of difference gives us a rough idea of each token's importance.
When It May Be Useful?
High-Quality Embeddings: If the embeddings are robust, the token importance estimates tend to be more reliable.
Simple, Unambiguous Prompts: Clear and straightforward prompts are where this method seems to offer the most reliable results.
Limitations
Complex Prompts: Ambiguity or complexity in the prompt can lead to less reliable estimates.
Typical Prompt Lengths: e.g. very short or very long prompts. These results tend to still be correlated but deviate a bit more than prompts of “average” length.
Models Used
GPT-2 embeddings
GPT-3 (ada-002) embeddings
GPT-3.5 (ada-002) embeddings
GPT-4 not compatible
Cost Calculation
Based on the model chosen, the cost is primarily based on the number of tokens in the prompt and the response. We have implemented the subsequent mathematical expression to ascertain the financial implications for the Chain of Thought (CoT) and Thread of Thought (ThoT).
total_cost = ((input_tokens / 1000) * prompt_price_per_1000_tokens) + ((output_tokens / 1000) * response_price_per_1000_tokens)
where the input_tokens
is prompt entered and the output_tokens
is generated response.
Multi-modal RAG
Image Retrieval RAG
We have implemented the feature by importing necessary modules such as base64
for image encoding, BytesIO
for handling byte streams, and Image
from PIL for image processing.
Steps
The main class in code includes three methods:
encode_image(self, image): This method takes an image file as input, opens it, determines its format, saves it to a BytesIO buffer in the same format, and then encodes this buffer into a Base64 string. The Base64 string is then returned.
config(self, messages, modelName): This method seems to interact with the AzureChatOpenAI class, possibly sending messages to an AI model deployed on Azure. The exact model used is determined by the
modelName
parameter. It then returns the AI-generated response.image_rag(self, payload): This is the main method of the class. It takes a payload containing text and file data. It uploads each file to Azure storage, encodes the image into a Base64 string, and adds this to the messages list alongside the text. It then sends these messages to the AI model via the
config
method and returns the response.
Video Retrieval RAG
Video RAG pipeline uses various libraries to extract and analyze content from a video file. It primarily focuses on converting video frames into base64-encoded images and extracting audio to generate a transcript. It utilizes libraries such as cv2 for video processing, moviepy for audio extraction, and speech_recognition for converting audio to text.
Steps
The main class in code includes three methods:
video_rag(payload): This method:
Receives a payload containing prompt and video file.
Processes the video to extract frames and audio using the process_video function.
Converts the audio to text using convert_audio_to_text.
Interacts with an Azure OpenAI model to generate a response based on the extracted frames and audio transcript.
process_video(video_path, seconds_per_frame): Handles video processing by:
Extracting frames at specified intervals.
Extracting audio and saving it as a WAV file.
convert_audio_to_text(audio_path): Uses speech recognition to transcribe audio into text.